-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
per-TenantShard read throttling #6706
Conversation
2442 tests run: 2325 passed, 0 failed, 117 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
23baada at 2024-02-16T19:29:12.761Z :recycle: |
…tle now costs 2.7% of CPU time
empty neon_local release build, on i3en.3xlarge instance store create 50 empty tenants configure page cache size & fd limits so everything stays within PS process and configure rate limit with 5000 IOPS: page_cache_size = 1638400 max_file_descriptors = 500000 [tenant_config.timeline_get_rate_limit] fair = true initial = 0 interval_millis = 100 interval_refill = 500 max = 5000 task_kinds = [ "PageRequestHandler" ] Overhead with throttle kicking in: --------------------------------- ./target/release/pagebench get-page-latest-lsn --limit-to-first-n-targets 1 --per-client-rate 100 --num-clients 500 shows as expected that 90% of requests get dropped because above throttle; 2024-02-14T13:30:12.505501Z INFO RPS: 4995 MISSED: 45453 CPU overhead: sudo perf record -p "$(pgrep pageserver)" -F666 --call-graph=dwarf,65528 -a sleep 4 && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg => search result for "throttle"; 1.6% of CPU time. Overhead while below throttle: ----------------------------- ./target/release/pagebench get-page-latest-lsn --limit-to-first-n-targets 50 --per-client-rate 400 --num-clients 1 2024-02-14T13:41:34.811237Z INFO RPS: 19588 MISSED: 434 CPU overhead: sudo perf record -p "$(pgrep pageserver)" -F666 --call-graph=dwarf,65528 -a sleep 4 && sudo perf script | ./stackcollapse-perf.pl| ./flamegraph.pl > flamegraph.svg => search result for "throttle"; 0.9% of CPU time
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it should work. The vast configuration options feels a bit like extra, but we can hopefully get to throttle those down in future. As far as I understood the perf is looking good.
Few nits for your consideration, maybe it's better to document on the Throttle that it will be shared between timelines instead everywhere else. At least that would had clarified for me why the arc-swap.
the Throttle has nice testable design, sadly I don't think it should get any tests because it just forwards to leaky_bucket.
… metrics + regression test (#6953) part of #5899 Problem ------- Before this PR, the time spent waiting on the throttle was charged towards the higher-level page_service metrics, i.e., `pageserver_smgr_query_seconds`. The metrics are the foundation of internal SLIs / SLOs. A throttled tenant would cause the SLI to degrade / SLO alerts to fire. Changes ------- - don't charge time spent in throttle towards the page_service metrics - record time spent in throttle in RequestContext and subtract it from the elapsed time - this works because the page_service path doesn't create child context, so, all the throttle time is recorded in the parent - it's quite brittle and will break if we ever decide to spawn child tasks that need child RequestContexts, which would have separate instances of the `micros_spent_throttled` counter. - however, let's punt that to a more general refactoring of RequestContext - add a test case that ensures that - throttling happens for getpage requests; this aspect of the test passed before this PR - throttling delays aren't charged towards the page_service metrics; this aspect of the test only passes with this PR - drive-by: make the throttle log message `info!`, it's an expected condition Performance ----------- I took the same measurements as in #6706 , no meaningful change in CPU overhead. Future Work ----------- This PR enables us to experiment with the throttle for select tenants without affecting the SLI metrics / triggering SLO alerts. Before declaring this feature done, we need more work to happen, specifically: - decide on whether we want to retain the flexibility of throttling any `Timeline::get` call, filtered by TaskKind - versus: separate throttles for each page_service endpoint, potentially with separate config options - the trouble here is that this decision implies changes to the TenantConfig, so, if we start using the current config style now, then decide to switch to a different config, it'll be a breaking change Nice-to-haves but probably not worth the time right now: - Equivalent tests to ensure the throttle applies to all other page_service handlers.
Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1707499237721679
Problem
part of epic #5899
Specifically, RFC #5648
tl;dr: we currently serve GetPage@LSN requests as fast as possible.
This sets the wrong incentives and poses operational and economic problems.
Summary of changes
Mechanism:
leaky-bucket
Timeline::get
TaskKind
s;PageRequestHandler
Timeline::get
requests from basebackup, so, it might slow down basebackup.The rationale for placing the throttle at the top of Timeline::get is, and for the TaskKind selector, is that it enables experimentation. E.g., we can also use it to rate limit compaction.
Observability:
Timeline::get
calls, as well as sum of wait time.compaction_period
seconds. Would like to have a better-defined reporting period, but, that would require a separate background task. If we add that in the future, we should move the walredo quiescing there, too.Example
Live reconfigure a tenant like so:
Testing & Performance
I made changes to pagebench so it issues requests at a defined rate.
If a request takes too long, it waits client-side until the next request at regular frequency would have been issued before sending the next request.
It prints the number of requests that weren't sent as
MISSED
in the live stats.I then used
pagebench
to benchmark the throttling code as followsMeasure CPU contribution of throttling using process-scoped flamegraph, searching for throttling's contribution by searching the flamegraph for
throttle
:Workload 0: 500 connections against one tenant, throttling disabled. 0.3% CPU time
Flame Graph
![2024-02-16 no throttle configured](https://github.com/neondatabase/neon/assets/956573/d7fa463d-f04d-48ba-82aa-619f0fa6c140)Workload 1: 500 connections against one tenant, target rate 10x above throttle: 1.3% cpu time
TODO: debate this workload giving the skipping behavior of pagebench. Should the client instead go as fast as possible to determine CPU overhead?
Flame Graph
![10x](https://github.com/neondatabase/neon/assets/956573/695148fc-7cbd-489d-9a30-0c5c053baa23)Workload 2: single client per tenant, request rate below limit, run in parallel against each tenant to increase sampling coverage (all way below 100% usage, not CPU bound). 1.1% cpu time.
Flame Graph
![en par](https://github.com/neondatabase/neon/assets/956573/b42f22f0-34f6-4783-a3a4-8c39140cb03c)TODOs
page_service.rs
metrics should mean on a rate limited tenantTimeline::get
Future Work
Non-Goals
As pointed out in the RFC, this is a mechanism to prevent noisy-neighbor problems in pageserver. It's not a product feature / something we'll sell. The mechanism will be used to put an artificial ceiling on how many operations a single TenantShard can serve.